2 code implementations•20 Dec 2021
Specifically, the task involves multi-hop questions that require reasoning over image-caption pairs to identify the grounded visual object being referred to and then predicting a span from the news body text to answer the question.
Answer GenerationData Augmentation+2